Wilcoxon Signed-Rank Test (paired, nonparametric)#

The Wilcoxon signed-rank test answers a common question:

When you measure the same subjects twice (before/after, left/right, pre/post), is there evidence of a systematic shift?

It is a nonparametric alternative to the paired t-test.

  • It works on the paired differences and uses ranks (so it’s more robust to outliers than a mean-based test).

  • It tests a location shift of the differences (often described as a test about the median difference, under a symmetry assumption).


Learning goals#

By the end you should be able to:

  • state the hypotheses and when the test is appropriate

  • compute the signed-rank statistic by hand (conceptually)

  • implement the test from scratch using only NumPy

  • visualize the test statistic under the null (exactly and via simulation)

  • interpret the p-value, direction, and an effect size


Prerequisites#

  • paired data / dependent samples

  • basic hypothesis testing (null/alternative, p-values)

  • familiarity with ranks (sorting, ties)

import math

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=6, suppress=True)
rng = np.random.default_rng(42)

1) What the test is (and when to use it)#

Setting#

You have paired observations ((x_i, y_i)) for (i=1,\dots,n).

Examples:

  • the same user’s latency before vs after an optimization

  • blood pressure pre vs post treatment

  • left vs right measurement on the same subject

Define paired differences:

[ d_i = x_i - y_i ]

Hypotheses (typical)#

  • Two-sided: (H_0): the distribution of (d) is symmetric around 0 (no location shift) vs (H_1): not symmetric around 0.

  • One-sided: (H_0) vs (H_1: ext{location}(d) > 0) or (< 0).

Under the symmetry assumption, the test is commonly described as:

[ H_0: ext{median}(d) = 0 ]

When it’s a good fit#

Use Wilcoxon signed-rank when:

  • samples are paired / dependent

  • differences are at least ordinal (you can rank magnitudes)

  • you don’t want to rely on normality of differences (paired t-test assumption)

When it’s not the right tool#

  • if the samples are independent (use Mann–Whitney U / Wilcoxon rank-sum instead)

  • if differences are wildly asymmetric (the test targets a “shift” under symmetry; consider permutation tests or other robust approaches)

  • if you have many exact zeros (the sign test might be a better match)

# Synthetic paired data: a small location shift with heavy-tailed noise
n = 22
before = rng.normal(50, 10, size=n)

# Heavy-tailed differences (outliers are plausible)
diff = rng.standard_t(df=3, size=n) * 2.0 + 2.0

# Add a couple of exact zeros to demonstrate the common convention (drop zeros)
diff[[3, 15]] = 0.0

after = before + diff

d = after - before
n, d[:8]
(22,
 array([ 4.983304,  1.364337,  2.776162,  0.      ,  2.892147, -0.95964 ,
         3.335939,  2.499282]))
# Paired plot: each line is one subject
x = np.ravel(np.column_stack([np.zeros(n), np.ones(n), np.full(n, np.nan)]))
y = np.ravel(np.column_stack([before, after, np.full(n, np.nan)]))

fig = go.Figure(
    go.Scatter(
        x=x,
        y=y,
        mode="lines+markers",
        line=dict(color="rgba(31, 119, 180, 0.35)", width=2),
        marker=dict(size=7, color="rgba(31, 119, 180, 0.85)"),
        showlegend=False,
    )
)
fig.update_layout(
    title="Paired measurements (before → after)",
    xaxis=dict(tickmode="array", tickvals=[0, 1], ticktext=["before", "after"], range=[-0.15, 1.15]),
    yaxis_title="value",
)
fig.show()

Intuition: we care about the direction and magnitude of differences#

The signed-rank test keeps the sign of each difference (positive/negative) and uses the rank of its magnitude.

  • A plain sign test uses only the signs.

  • A paired t-test uses magnitudes but through the mean (more sensitive to outliers).

Wilcoxon signed-rank sits in between: it’s rank-based (robust) but still rewards consistently large differences.

# Distribution of paired differences
fig = px.histogram(
    x=d,
    nbins=16,
    title="Paired differences (after - before)",
    labels={"x": "difference"},
)
fig.add_vline(x=float(np.median(d)), line_dash="dash", line_color="black", annotation_text="median")
fig.show()

fig = px.violin(
    y=d,
    box=True,
    points="all",
    title="Differences with individual points",
    labels={"y": "difference"},
)
fig.add_hline(y=0, line_dash="dot", line_color="gray")
fig.show()

2) The signed-rank statistic (how it’s computed)#

Given differences (d_1,\dots,d_n):

  1. Drop zeros: keep only (d_i eq 0). Let (m) be the number of remaining differences.

  2. Compute (|d_i|) and rank them from smallest to largest.

    • If there are ties in (|d_i|), assign average ranks.

  3. Sum ranks for positive differences:

[ W^+ = \sum_{i: d_i > 0} \mathrm{rank}(|d_i|) ]

  1. Sum ranks for negative differences:

[ W^- = \sum_{i: d_i < 0} \mathrm{rank}(|d_i|) ]

For a two-sided test, a common statistic is:

[ W = \min(W^+, W^-) ]

  • If there’s no shift, positives and negatives should be “balanced”, so (W^+) and (W^-) should be similar.

  • If the true shift is positive, (W^+) tends to be large (and (W^-) small).


What makes it “exact” under (H_0)?#

Under (H_0), each non-zero difference is equally likely to be positive or negative (a random sign).

So the null distribution of (W^+) is the distribution of a random subset sum of the ranks.

def rankdata_average(a: np.ndarray) -> np.ndarray:
    'Rank data with average ranks for ties (like scipy.stats.rankdata(method="average")).'
    a = np.asarray(a)
    if a.ndim != 1:
        raise ValueError("rankdata_average expects a 1D array")
    n = a.size
    if n == 0:
        return a.astype(float)

    sorter = np.argsort(a, kind="mergesort")  # stable
    a_sorted = a[sorter]

    # Tie blocks in sorted order
    _, first_idx, counts = np.unique(a_sorted, return_index=True, return_counts=True)

    ranks_sorted = np.empty(n, dtype=float)
    for start, count in zip(first_idx, counts):
        end = start + count
        # 1-based ranks: start_rank = start+1, end_rank = end
        avg_rank = 0.5 * ((start + 1) + end)
        ranks_sorted[start:end] = avg_rank

    ranks = np.empty(n, dtype=float)
    ranks[sorter] = ranks_sorted
    return ranks


def signed_rank_components(d: np.ndarray):
    'Compute ranks and W+/W- for differences d, dropping zeros.'
    d = np.asarray(d, dtype=float)
    d = d[np.isfinite(d)]

    d_nz = d[d != 0]
    abs_d = np.abs(d_nz)
    ranks = rankdata_average(abs_d)

    w_plus = float(ranks[d_nz > 0].sum())
    w_minus = float(ranks[d_nz < 0].sum())

    return {
        "d": d_nz,
        "abs_d": abs_d,
        "ranks": ranks,
        "w_plus": w_plus,
        "w_minus": w_minus,
        "m": int(d_nz.size),
    }


parts = signed_rank_components(d)
parts["m"], parts["w_plus"], parts["w_minus"]
(20, 180.0, 30.0)
# Visual: each bar is one |difference| rank, colored by sign
ranks = parts["ranks"]
d_nz = parts["d"]

order = np.argsort(parts["abs_d"])

colors = np.where(d_nz[order] > 0, "#2ca02c", "#d62728")
labels = [f"#{i+1}" for i in range(parts["m"])]

fig = go.Figure(
    go.Bar(
        x=labels,
        y=ranks[order],
        marker_color=colors,
        hovertext=[
            f"d={d_nz[j]:.3f}<br>|d|={abs(d_nz[j]):.3f}<br>rank={ranks[j]:.1f}"
            for j in order
        ],
        hoverinfo="text",
    )
)
fig.update_layout(
    title="Signed-rank building blocks (green = positive difference, red = negative)",
    xaxis_title="differences sorted by |d|",
    yaxis_title="rank(|d|)",
)
fig.show()

3) A NumPy-only implementation (including p-values)#

We’ll implement:

  • average ranks for ties

  • the signed-rank sums (W^+), (W^-)

  • exact p-values when there are no ties (small/medium (m)) via subset-sum DP

  • a normal approximation fallback (with optional continuity correction)

This mirrors the logic used in standard statistical libraries, but keeps the steps explicit.

def _normal_cdf(z: float) -> float:
    return 0.5 * (1.0 + math.erf(z / math.sqrt(2.0)))


def wilcoxon_exact_pmf_no_ties(m: int) -> np.ndarray:
    'PMF of W+ under H0 when ranks are exactly 1..m (no ties, no zeros).'
    if m < 0:
        raise ValueError("m must be non-negative")
    total = m * (m + 1) // 2

    counts = np.zeros(total + 1, dtype=np.int64)
    counts[0] = 1

    # Subset-sum DP: counts[s] = number of subsets of {1..m} that sum to s
    for r in range(1, m + 1):
        counts[r:] += counts[:-r]

    denom = 2 ** m
    return counts.astype(float) / float(denom)


def wilcoxon_signed_rank(
    x: np.ndarray,
    y: np.ndarray | None = None,
    *,
    mu0: float = 0.0,
    alternative: str = "two-sided",
    method: str = "auto",  # auto | exact | normal
    correction: bool = True,
):
    'Wilcoxon signed-rank test, implemented with NumPy (educational).'

    x = np.asarray(x, dtype=float)
    if y is None:
        d = x - mu0
    else:
        y = np.asarray(y, dtype=float)
        if x.shape != y.shape:
            raise ValueError("x and y must have the same shape")
        d = (x - y) - mu0

    parts = signed_rank_components(d)
    m = parts["m"]
    if m == 0:
        raise ValueError("All differences are zero (after dropping zeros)")

    d_nz = parts["d"]
    ranks = parts["ranks"]
    w_plus = parts["w_plus"]
    w_minus = parts["w_minus"]
    total_rank_sum = float(ranks.sum())

    has_ties = np.unique(parts["abs_d"]).size != m

    if alternative not in {"two-sided", "greater", "less"}:
        raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")

    if method == "auto":
        method_ = "exact" if (not has_ties and m <= 30) else "normal"
    else:
        method_ = method

    if method_ == "exact":
        if has_ties:
            raise ValueError("Exact p-values require no ties in |d|. Use method='normal' or method='auto'.")

        probs = wilcoxon_exact_pmf_no_ties(m)
        w_plus_int = int(round(w_plus))

        cdf = float(probs[: w_plus_int + 1].sum())
        sf = float(probs[w_plus_int:].sum())

        if alternative == "greater":
            p_value = sf
        elif alternative == "less":
            p_value = cdf
        else:
            p_value = min(1.0, 2.0 * min(cdf, sf))

        z = None

    elif method_ == "normal":
        mu = 0.5 * total_rank_sum
        sigma2 = 0.25 * float(np.sum(ranks**2))
        sigma = math.sqrt(sigma2)

        w_cc = w_plus
        if correction:
            delta = w_plus - mu
            if delta > 0:
                w_cc = w_plus - 0.5
            elif delta < 0:
                w_cc = w_plus + 0.5

        z = (w_cc - mu) / sigma

        if alternative == "greater":
            p_value = 1.0 - _normal_cdf(z)
        elif alternative == "less":
            p_value = _normal_cdf(z)
        else:
            p_value = 2.0 * min(_normal_cdf(z), 1.0 - _normal_cdf(z))

    else:
        raise ValueError("method must be 'auto', 'exact', or 'normal'")

    rank_biserial = (w_plus - w_minus) / total_rank_sum  # in [-1, 1]
    median_diff = float(np.median(d_nz))

    return {
        "m": m,
        "w_plus": float(w_plus),
        "w_minus": float(w_minus),
        "statistic": float(min(w_plus, w_minus)) if alternative == "two-sided" else float(w_plus),
        "p_value": float(p_value),
        "alternative": alternative,
        "method": method_,
        "median_diff": median_diff,
        "rank_biserial": float(rank_biserial),
        "z": None if z is None else float(z),
    }


res = wilcoxon_signed_rank(after, before, alternative="greater", method="auto")
res
{'m': 20,
 'w_plus': 180.0,
 'w_minus': 30.0,
 'statistic': 180.0,
 'p_value': 0.001827239990234375,
 'alternative': 'greater',
 'method': 'exact',
 'median_diff': 1.7795950541624563,
 'rank_biserial': 0.7142857142857143,
 'z': None}
# Practical usage: compare with SciPy
from scipy import stats

scipy_res = stats.wilcoxon(after, before, alternative="greater", zero_method="wilcox", method="auto")
scipy_res
WilcoxonResult(statistic=180.0, pvalue=0.0025555243609691396)

4) Visualizing the null distribution and the p-value#

For no ties and moderate (m), we can compute the exact null distribution of (W^+).

Below, we plot the PMF of (W^+) under (H_0) and highlight the observed tail probability.

# Exact null distribution (only valid when there are no ties in |d|)
parts = signed_rank_components(d)
m = parts["m"]
has_ties = np.unique(parts["abs_d"]).size != m

if has_ties:
    print("This dataset has ties in |d|, so an exact PMF is not shown.")
else:
    probs = wilcoxon_exact_pmf_no_ties(m)
    sums = np.arange(probs.size)

    w_plus = int(round(parts["w_plus"]))
    tail_mask = sums >= w_plus
    tail_probs = np.where(tail_mask, probs, 0.0)

    fig = go.Figure()
    fig.add_trace(go.Bar(x=sums, y=probs, name="PMF", marker_color="rgba(31, 119, 180, 0.35)"))
    fig.add_trace(go.Bar(x=sums, y=tail_probs, name="tail", marker_color="rgba(214, 39, 40, 0.65)"))

    fig.add_vline(x=w_plus, line_dash="dash", line_color="black", annotation_text=f"W+ = {w_plus}")

    p_tail = float(probs[tail_mask].sum())

    fig.update_layout(
        title=f"Exact null distribution of W+ (m={m}) — one-sided tail p={p_tail:.4f}",
        xaxis_title="W+",
        yaxis_title="P(W+ = w)",
        barmode="overlay",
    )
    fig.show()
# Monte Carlo view: random sign-flips on the observed ranks (works with ties)
parts = signed_rank_components(d)
ranks = parts["ranks"]
w_plus_obs = parts["w_plus"]

n_sims = 50_000
signs = rng.choice([-1, 1], size=(n_sims, ranks.size))
w_plus_sim = (ranks * (signs > 0)).sum(axis=1)

fig = px.histogram(
    w_plus_sim,
    nbins=40,
    title="Null distribution by sign-flip simulation (Monte Carlo)",
    labels={"value": "W+"},
)
fig.add_vline(x=w_plus_obs, line_dash="dash", line_color="black", annotation_text="observed W+")
fig.show()

p_mc_greater = float(np.mean(w_plus_sim >= w_plus_obs))
print(f"Monte Carlo one-sided p (greater): {p_mc_greater:.4f}")
Monte Carlo one-sided p (greater): 0.0017

5) Interpreting the result#

What the p-value means#

A p-value is always a statement assuming the null hypothesis is true.

For Wilcoxon signed-rank (paired):

  • (H_0) says there is no systematic shift in the paired differences (symmetry around 0).

  • The p-value is the probability of seeing a signed-rank sum at least as extreme as observed if signs were effectively random.

Direction and “how big is it?”#

The test answers “is there evidence of a shift?”, not “how large is the shift?”.

To communicate size + direction, report alongside the p-value:

  • median difference (robust location summary)

  • an effect size like rank-biserial correlation:

[ ext{RBC} = rac{W^+ - W^-}{W^+ + W^-} \in [-1, 1] ]

Interpretation (rule of thumb):

  • RBC near +1: most rank mass is on the positive side (x > y)

  • RBC near -1: most rank mass is on the negative side (x < y)

  • RBC near 0: balanced positives and negatives

res = wilcoxon_signed_rank(after, before, alternative="greater", method="auto")

alpha = 0.05
print(f"m (non-zero diffs): {res['m']}")
print(f"W+ = {res['w_plus']:.3f}, W- = {res['w_minus']:.3f}")
print(f"p-value ({res['method']}, {res['alternative']}): {res['p_value']:.6f}")
print(f"median(after - before) = {res['median_diff']:.3f}")
print(f"rank-biserial correlation = {res['rank_biserial']:.3f}")

if res["p_value"] < alpha:
    print(f"
Decision @ α={alpha}: reject H0 (evidence of a positive shift).")
else:
    print(f"
Decision @ α={alpha}: fail to reject H0 (insufficient evidence of a positive shift).")
  Cell In[11], line 11
    print(f"
          ^
SyntaxError: unterminated f-string literal (detected at line 11)

6) Pitfalls and diagnostics#

  • Pairedness matters: Wilcoxon signed-rank is for paired samples. If your samples are independent, use a different test.

  • Symmetry of differences: the signed-rank test is designed for a location shift under (approximate) symmetry of (d).

    • If (d) is highly skewed, consider a permutation test on a robust statistic or model the differences directly.

  • Zeros and ties:

    • Many exact zeros weaken the test (less information). Be explicit about the convention (drop zeros vs other treatments).

    • Many ties reduce the effective resolution; exact p-values become trickier.

  • Statistical vs practical significance: a tiny shift can be “significant” with large (n); always report an effect size.

7) Exercises#

  1. Simulate paired data where the paired t-test rejects but Wilcoxon does not (and vice versa). What changed?

  2. Construct a dataset with many zeros. Compare:

    • Wilcoxon signed-rank (drop zeros)

    • sign test (count positives vs negatives)

  3. Implement a two-sided exact p-value visualization (both tails) and compare it to SciPy’s alternative='two-sided'.

References#

  • SciPy docs: scipy.stats.wilcoxon

  • Nonparametric statistics textbooks / lecture notes on signed-rank tests